Reinforcement Learning with Unsupervised Auxiliary Tasks

نویسندگان

Max Jaderberg

Volodymyr Mnih

Wojciech Czarnecki

Tom Schaul

Joel Z. Leibo

David Silver

Koray Kavukcuoglu

چکیده

Deep reinforcement learning agents have achieved state-of-the-art results by directly maximising cumulative reward. However, environments contain a much wider variety of possible training signals. In this paper, we introduce an agent that also maximises many other pseudo-reward functions simultaneously by reinforcement learning. All of these tasks share a common representation that, like unsupervised learning, continues to develop in the absence of extrinsic rewards. We also introduce a novel mechanism for focusing this representation upon extrinsic rewards, so that learning can rapidly adapt to the most relevant aspects of the actual task. Our agent significantly outperforms the previous state-of-theart on Atari, averaging 880% expert human performance, and a challenging suite of first-person, three-dimensional Labyrinth tasks leading to a mean speedup in learning of 10× and averaging 87% expert human performance on Labyrinth. Natural and artificial agents live in a stream of sensorimotor data. At each time step t, the agent receives observations ot and executes actions at. These actions influence the future course of the sensorimotor stream. In this paper we develop agents that learn to predict and control this stream, by solving a host of reinforcement learning problems, each focusing on a distinct feature of the sensorimotor stream. Our hypothesis is that an agent that can flexibly control its future experiences will also be able to achieve any goal with which it is presented, such as maximising its future rewards. The classic reinforcement learning paradigm focuses on the maximisation of extrinsic reward. However, in many interesting domains, extrinsic rewards are only rarely observed. This raises questions of what and how to learn in their absence. Even if extrinsic rewards are frequent, the sensorimotor stream contains an abundance of other possible learning targets. Traditionally, unsupervised learning attempts to reconstruct these targets, such as the pixels in the current or subsequent frame. It is typically used to accelerate the acquisition of a useful representation. In contrast, our learning objective is to predict and control features of the sensorimotor stream, by treating them as pseudorewards for reinforcement learning. Intuitively, this set of tasks is more closely matched with the agent’s long-term goals, potentially leading to more useful representations. Consider a baby that learns to maximise the cumulative amount of red that it observes. To correctly predict the optimal value, the baby must understand how to increase “redness” by various means, including manipulation (bringing a red object closer to the eyes); locomotion (moving in front of a red object); and communication (crying until the parents bring a red object). These behaviours are likely to recur for many other goals that the baby may subsequently encounter. No understanding of these behaviours is required to simply reconstruct the redness of current or subsequent images. Our architecture uses reinforcement learning to approximate both the optimal policy and optimal value function for many different pseudo-rewards. It also makes other auxiliary predictions that serve to focus the agent on important aspects of the task. These include the long-term goal of predicting cumulative extrinsic reward as well as short-term predictions of extrinsic reward. To learn more efficiently, our agents use an experience replay mechanism to provide additional updates ∗Joint first authors. Ordered alphabetically by first name. 1 ar X iv :1 61 1. 05 39 7v 1 [ cs .L G ] 1 6 N ov 2 01 6

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Self-Supervision for Reinforcement Learning

Reinforcement learning optimizes policies for expected cumulative reward. Need the supervision be so narrow? Reward is delayed and sparse for many tasks, making it a difficult and impoverished signal for end-to-end optimization. To augment reward, we consider a range of self-supervised tasks that incorporate states, actions, and successors to provide auxiliary losses. These losses offer ubiquit...

متن کامل

Loss is its own Reward: Self-Supervision for Reinforcement Learning

Reinforcement learning optimizes policies for expected cumulative reward. Need the supervision be so narrow? Reward is delayed and sparse for many tasks, making it a difficult and impoverished signal for end-to-end optimization. To augment reward, we consider a range of selfsupervised tasks that incorporate states, actions, and successors to provide auxiliary losses. These losses offer ubiquito...

متن کامل

Learning to Shoot Goals Analysing the Learning Process and the Resulting Policies

Reinforcement learning is a very general unsupervised learning mechanism. Due to its generality reinforcement learning does not scale very well for tasks that involve inferring subtasks. In particular when the subtasks are dynamically changing and the environment is adversarial. One of the most challenging reinforcement learning tasks so far has been the 3 to 2 keepaway task in the RoboCup simu...

متن کامل

Deep Semi-Supervised Learning with Linguistically Motivated Sequence Labeling Task Hierarchies

In this paper we present a novel Neural Network algorithm for conducting semisupervised learning for sequence labeling tasks arranged in a linguistically motivated hierarchy. This relationship is exploited to regularise the representations of supervised tasks by backpropagating the error of the unsupervised task through the supervised tasks. We introduce a neural network where lower layers are ...

متن کامل

A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

Significant plasticity in sensory cortical representations can be driven in mature animals either by behavioural tasks that pair sensory stimuli with reinforcement, or by electrophysiological experiments that pair sensory input with direct stimulation of neuromodulatory nuclei, but usually not by sensory stimuli presented alone. Biologically motivated theories of representational learning, howe...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1611.05397 شماره

صفحات -

تاریخ انتشار 2016

Reinforcement Learning with Unsupervised Auxiliary Tasks

نویسندگان

چکیده

منابع مشابه

Self-Supervision for Reinforcement Learning

Loss is its own Reward: Self-Supervision for Reinforcement Learning

Learning to Shoot Goals Analysing the Learning Process and the Resulting Policies

Deep Semi-Supervised Learning with Linguistically Motivated Sequence Labeling Task Hierarchies

A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

عنوان ژورنال:

اشتراک گذاری